We can find data on the Combined Land-Surface Air and Sea-Surface Water Temperature Anomalies in the Northern Hemisphere at NASA’s Goddard Institute for Space Studies. The tabular data of temperature anomalies can be found here
weather <-
read_csv("https://data.giss.nasa.gov/gistemp/tabledata_v4/NH.Ts+dSST.csv",
skip = 1,
na = "***")#inspecting data.frame 'weather'
#glimpse(weather)
# selecting the year and 12 month variables
weather_clean <- weather %>%
select(1:13)
# converting to the dataframe to longer format (tidyweather)
tidyweather <- pivot_longer(weather_clean, cols = 2:13, names_to = "Month", values_to = "delta")
tidyweather## # A tibble: 1,716 × 3
## Year Month delta
## <dbl> <chr> <dbl>
## 1 1880 Jan -0.36
## 2 1880 Feb -0.5
## 3 1880 Mar -0.23
## 4 1880 Apr -0.29
## 5 1880 May -0.06
## 6 1880 Jun -0.16
## 7 1880 Jul -0.17
## 8 1880 Aug -0.25
## 9 1880 Sep -0.22
## 10 1880 Oct -0.3
## # … with 1,706 more rows
We plot the data using a time-series scatter plot, and add a trendline. To do that, we first need to create a new variable called date in order to ensure that the delta values are plot chronologically.
tidyweather <- tidyweather %>%
mutate(date = ymd(paste(as.character(Year), Month, "1")),
month = month(date, label=TRUE),
year = year(date))
ggplot(tidyweather, aes(x=date, y = delta))+
geom_point()+
geom_smooth(color="red") +
theme_bw() +
labs (
title = "Weather Anomalies"
)Use facet_wrap() to produce a seperate scatter plot for each month, again with a smoothing line.
ggplot(tidyweather) +
aes(x = date, y = delta) +
geom_point () +
geom_smooth (color = "red") +
theme_bw() +
labs (title = "Weather Changes by Month") +
facet_wrap(~ month) - Delta on the y-axis, represents the temperature deviations from the expected value. The red smooth lines we drew are the trend lines for the temperature changes. The steeper the smooth lines are, the higher the rates of temperature increases.
Before 1950, visually, we can see that for Januarys and Februaries, the effect of increasing temperature was more pronounced. For instance, the smooth line drawn for January is much steeper than that of July and August.
We can tell that the year 1950 was roughly a point, after which, the smooth line becomes much steeper for all months compared to before 1950. There are months where the effect of increasing temperature was more pronounced, but these effects are not distinctive visually.
It is sometimes useful to group data into different time periods to study historical data. For example, we often refer to decades such as 1970s, 1980s, 1990s etc. to refer to a period of time. NASA calcuialtes a temperature anomaly, as difference form the base periof of 1951-1980. The code below creates a new data frame called comparison that groups data in five time periods: 1881-1920, 1921-1950, 1951-1980, 1981-2010 and 2011-present.
We remove data before 1800 and before using filter. Then, we use the mutate function to create a new variable interval which contains information on which period each observation belongs to. We can assign the different periods using case_when().
comparison <- tidyweather %>%
filter(Year>= 1881) %>% #remove years prior to 1881
#create new variable 'interval', and assign values based on criteria below:
mutate(interval = case_when(
Year %in% c(1881:1920) ~ "1881-1920",
Year %in% c(1921:1950) ~ "1921-1950",
Year %in% c(1951:1980) ~ "1951-1980",
Year %in% c(1981:2010) ~ "1981-2010",
TRUE ~ "2011-present"
))Now that we have the interval variable, we can create a density plot to study the distribution of monthly deviations (delta), grouped by the different time periods we are interested in. Set fill to interval to group and colour the data by different time periods.
ggplot(comparison) +
aes(x = delta, fill = interval) +
geom_density(alpha = 0.3) +
theme_minimal(base_size = 18)So far, we have been working with monthly anomalies. However, we might be interested in average annual anomalies. We can do this by using group_by() and summarise(), followed by a scatter plot to display the result.
#creating yearly averages
average_annual_anomaly <- tidyweather %>%
group_by(Year) %>% #grouping data by Year
# creating summaries for mean delta
# use `na.rm=TRUE` to eliminate NA (not available) values
summarise(mean_delta = mean(delta), na.rm = TRUE)
#plotting the data:
ggplot(average_annual_anomaly) +
aes(x = Year, y = mean_delta) +
geom_point() +
#Fit the best fit line, using LOESS method
#change theme to theme_bw() to have white background + black frame around plot
aes(x = Year, y = mean_delta) +
geom_smooth(method = loess, color = "red") +
theme_bw( base_size = 18) +
labs (title = "Average Annual Anomolies Over Time", x = "Year", y = "Average Annual Temperature Change") delta# choose the interval 2011-present
# what dplyr verb will you use?
formula_ci <- comparison %>%
filter(interval == "2011-present") %>%
# calculate summary statistics for temperature deviation (delta)
# calculate mean, SD, count, SE, lower/upper 95% CI
# what dplyr verb will you use
summarise(mean = mean(delta, na.rm = TRUE),
count = n(),
SD = sd(delta, na.rm = TRUE),
SE = SD / sqrt(count),
# find t critical value with n-1 degrees of freedom
t_critical = qt(0.975,count -1),
margin_of_error = SE * t_critical,
lower_95pct_CI = mean - margin_of_error,
upper_95pct_CI = mean + margin_of_error)
# print out formula_CI
formula_ci## # A tibble: 1 × 8
## mean count SD SE t_critical margin_of_error lower_95pct_CI upper_95p…¹
## <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1.07 144 0.266 0.0222 1.98 0.0438 1.02 1.11
## # … with abbreviated variable name ¹upper_95pct_CI
# calculate ci using the infer package (bootstrap simulation)
set.seed(1234)
boot_ci <- comparison %>%
# select the interval of interest
filter(interval == "2011-present") %>%
# specify the variable of interest
specify (response = delta) %>%
# generate bootstrap samples
generate (reps = 1000, type = "bootstrap") %>%
# find mean
calculate(stat = "mean") %>%
# find the ci
get_confidence_interval(level = 0.95, type = "percentile")
# print out boot_ci
boot_ci## # A tibble: 1 × 2
## lower_ci upper_ci
## <dbl> <dbl>
## 1 1.02 1.11
approval_polllist <- read.csv('https://projects.fivethirtyeight.com/biden-approval-data/approval_polllist.csv')
approval_polllist <- approval_polllist %>%
mutate(enddate = mdy(enddate),
year = year(enddate),
month = month(enddate), #default format in r is year ymd
week = isoweek(enddate)) # using lubridate to change the format of the dates.
approval_polllist<- approval_polllist%>%
filter(year == 2022) %>% # filter to remove all other years other than 2022.
group_by(week, subgroup) %>% # grouping by week then subgroup, so that we can conitnue analysis
summarise(app_disapp = approve-disapprove, #finding the net ratings
mean_approve_disapprove = mean(app_disapp), # finding mean of net ratings
sd_rating = sd(app_disapp), # finding sd
count=n(), # counting number of polls in each week
t_critical=qt(0.975,count-1) , #t distribution, the bigger the sample count, the faster t_critical approach 1.96
se_approve_disapp = sd(app_disapp)/sqrt(count), # finding standard error
margin_of_error = t_critical*se_approve_disapp,
rating_low = mean_approve_disapprove - margin_of_error, # make our ribbon bands
rating_high = mean_approve_disapprove+ margin_of_error
)
ggplot(approval_polllist, aes(x=week, y=mean_approve_disapprove))+geom_ribbon(aes( ymax = rating_high, ymin=rating_low, fill = subgroup))+ geom_line()+ facet_grid(rows = vars(subgroup)) + labs(x = 'Week in 2022', y ='', title = "Biden's Net Approval Ratings in 2022", subtitle = "Weekly Data, Approve - Disapprove, %")+ylim(-25,-5)+xlim(0,35) For Biden’s net approval ratings, we can see similar trends along the 3 sub-groups for the weeks in 2022.
url <- "https://data.london.gov.uk/download/number-bicycle-hires/ac29363e-e0cb-47cc-a97a-e216d900a6b0/tfl-daily-cycle-hires.xlsx"
# Download TFL data to temporary file
httr::GET(url, write_disk(bike.temp <- tempfile(fileext = ".xlsx")))## Response [https://airdrive-secure.s3-eu-west-1.amazonaws.com/london/dataset/number-bicycle-hires/2022-09-06T12%3A41%3A48/tfl-daily-cycle-hires.xlsx?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAJJDIMAIVZJDICKHA%2F20220911%2Feu-west-1%2Fs3%2Faws4_request&X-Amz-Date=20220911T204946Z&X-Amz-Expires=300&X-Amz-Signature=193aa70b7c43ddb9942e4610910d11499b295eac7b9c15f1f49d96c664168370&X-Amz-SignedHeaders=host]
## Date: 2022-09-11 20:51
## Status: 200
## Content-Type: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
## Size: 180 kB
## <ON DISK> /var/folders/7x/psgggfkx7638wxs95nnmhr_00000gn/T//Rtmpzl6nt9/file173e8792507ec.xlsx
# Use read_excel to read it as dataframe
bike0 <- read_excel(bike.temp,
sheet = "Data",
range = cell_cols("A:B"))
# change dates to get year, month, and week
bike <- bike0 %>%
clean_names() %>%
rename (bikes_hired = number_of_bicycle_hires) %>%
mutate (year = year(day),
month = lubridate::month(day, label = TRUE),
week = isoweek(day))The second one looks at percentage changes from the expected level of weekly rentals. The two grey shaded rectangles correspond to Q2 (weeks 14-26) and Q4 (weeks 40-52).
We use the mean to calculate the expected rentals. We used the mean to calculate our expected rentals because the TFL may be more inclined to use the data for profit and revenue analysis. Mean, instead of median, may be more accurate to represent the revenue from bike rental.